The Lexicon Builder
نویسندگان
چکیده
Center for Biomedical Informatics Research, Stanford University, Stanford, CA Domain specific biomedical lexicons are extensively used by researchers for natural language processing tasks. Currently these lexicons are created manually by expert curators and there is a pressing need for automated methods to compile such lexicons. The Lexicon Builder Web service addresses this need and reduces the investment of time and effort involved in lexicon maintenance. The service has three components: Inclusion – selects one or several ontologies (or its branches) and includes preferred names and synonym terms; Exclusion filters terms based on the term’s Medline frequency, syntactic type, UMLS semantic type and match with stopwords; Output aggregates information, handles compression and output formats. Evaluation demonstrates that the service has high accuracy and runtime performance. It is currently being evaluated for several use cases to establish its utility in biomedical information processing tasks. The Lexicon Builder promotes collaboration, sharing and standardization of lexicons amongst researchers by automating the creation, maintainence and cross referencing of custom lexicons. Introduction and background The analysis of the enormous amount of publicly available biomedical data requires the use of biomedical ontologies to structure and annotate datasets with controlled terms in order to facilitate search, retrieval and data integration. Biomedical researchers routinely use ontologies and terminologies to annotate their data for better data integration and translational discoveries The National Center for Biomedical Ontology (NCBO) builds tools and services to assist the biomedical community in using ontologies to annotate and analyze biomedical data . With the large number and variety (of formats and locations) of biomedical ontologies, the task of choosing the right ontology for an annotation task or for designing a curation tool is a challenge. , or to recommend an appropriate ontology for annotation Increasingly, natural language processing (NLP) tools are used in annotation of biomedical data as well as in curation pipelines . . Even if the ontology to use in an NLP tool is identified and the tool can have programmatic access to a large number of biomedical ontologies in the NCBO BioPortal A lexicon(also called a dictionary) is a core component of any natural language processing system. For example, the SPECIALIST lexicon is a large syntactic lexicon of biomedical and general English. The use of lexicons, derived from terminologies and ontologies, for text mining and information extraction tasks is not new in the biomedical community. For example, the BioLexicon has been used in three text mining tasks a) BLTagger which is a dictionary-based parts-of-speech (POS) tagger; b) Enju full parser enriched using the lexicon; c) Lexicon-based query processing for information retrieval. Medication information was extracted from discharge summaries using parsing rules written as a set of regular expressions and a user-configurable drug lexicon. The authors acknowledge the necessity of careful lexicon selection for the extraction of drug information and to make the lexicon a configurable component in their system. The MedLEE lexicon was used to mine a clinical data warehouse for disease-finding associations. The authors also mention that the MedLEE lexicon does not cover a large number of medical terms and using a larger coverage lexicon would improve the discovered associations. The authors acknowledge that an important class of named entity recognition approaches is lexiconbased and in order to improve the F-measure (combination of Precision and Recall) scores high-quality lexicons are essential. , a significant amount of pre-processing is required to effectively use existing ontologies in natural language processing pipelines. Basic text-mining resources, such as domain-specific thesauri and lexicons, need to be developed and shared across research groups and curation tasks; in order to extend the depth as well as breadth of the information that is curated, searched, and mined. Ontologies and terminologies together with lexicons are important for advanced text mining and both are needed in order to produce highly accurate results needed by biomedical experts and to obtain broad coverage of biomedical text. The authors acknowledge that named entity recognition (NER) tasks require extensive domain-specific lexicons, which do not readily exist. The authors argue that custom, domain specific lexicons are important background knowledge in medical language-processing systems. The main motivation for developing the NCBO Lexicon Builder Web service is to allow users to create custom domain-specific lexicons for specific NLP, data mining and information extraction tasks. For example, using our service, a researcher can compile a lexicon for identifying malignant skin tumors spanning multiple public ontologies. Currently, the creation of custom lexicons with biomedical ontology concepts is not a prevalent practice in the biomedical community for several possible reasons: • Creation of custom lexicons requires a huge investment and the accuracy and coverage of resulting lexicons is often questionable; • The large number of biomedical ontologies available for creating lexicons coupled with the frequent changes and overlap in these ontologies significantly increases the complexity; • Integrating related concepts over multiple related ontologies without the knowledge of the structure of ontologies is difficult and error prone; and limits the coverage of the lexicon. The Lexicon Builder Web service automates the task of creating custom lexicons across multiple biomedical ontologies. The service leverages the Medline analysis to produce lexicons with high accuracy and coverage.
منابع مشابه
A Tool Kit for Lexicon Building
This paper describes a set of interactive routines that can be used to create, maintain, and update a computer lexicon. The routines are available to the user as a set of commands resembling a simple operating system. The lexicon produced by this system is based on lexical-semantic relations, but is compatible with a variety of other models of lexicon structure. The lexicon builder is suitable ...
متن کاملA Syntactic Lexicon for Arabic Verbs
In this paper, we present a modeling of the syntactic lexicon for Arabic verbs based on the Lexical Markup Framework. This ISO standard let us describe the lexical information in a simple way using general guidelines and enable the sharing of resources following the standard. We discuss the syntactic information associated to verbs and the model we propose to structure and represent the entries...
متن کاملA MWE Acquisition and Lexicon Builder Web Service
This paper describes the development of a web-service tool for the automatic extraction of Multi-word expressions lexicons, which has been integrated in a distributed platform for the automatic creation of linguistic resources. The main purpose of the work described is thus to provide a (computationally “light”) tool that produces a full lexical resource: multi-word terms/items with relevant an...
متن کاملBuilding a Domain-Specific French-Korean Lexicon
Korean government has adopted the French TGV as a high-speed transportation system and the first service is scheduled at the end of 2003. TGV-relevant documents are consisted of huge volumes, of which over than 76% has been translated in English. A large part of the English version is, however, incomprehensible without referring to the original French version. The goal of this paper is to demon...
متن کاملAUTOLEX: An Automatic Lexicon Builder for Minority Languages Using an Open Corpus
The aim of this study is to build natural language resources for languages with limited resources or minority languages. Manually building these resources is tedious and costly. These natural language resources such as a language corpora and lexicon will be used for natural language processing research and system development. Tagalog, a minority language was considered in this study as a test b...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010